Executive Summary

This project analyzes housing price data from King County, Washington, to build predictive models for home values. The analysis includes comprehensive data cleaning, exploratory data analysis, linear regression modeling, and logistic regression for home quality classification.

Data Cleaning and Preprocessing

This section includes steps to identify and fix data entry errors and transform problematic variables in the King County housing dataset.

Load and Inspect the Data

# Load the King County housing dataset
data <- read.csv("kc_house_data.csv", sep=",", header = TRUE)

# Display first 5 rows
head(data)
##           id            date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000  221900        3      1.00        1180     5650
## 2 6414100192 20141209T000000  538000        3      2.25        2570     7242
## 3 5631500400 20150225T000000  180000        2      1.00         770    10000
## 4 2487200875 20141209T000000  604000        4      3.00        1960     5000
## 5 1954400510 20150218T000000  510000        3      2.00        1680     8080
## 6 7237550310 20140512T000000 1225000        4      4.50        5420   101930
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1      1          0    0         3     7       1180             0     1955
## 2      2          0    0         3     7       2170           400     1951
## 3      1          0    0         3     6        770             0     1933
## 4      1          0    0         5     7       1050           910     1965
## 5      1          0    0         3     8       1680             0     1987
## 6      1          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930
# Check data structure
str(data)
## 'data.frame':    21613 obs. of  21 variables:
##  $ id           : num  7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
##  $ date         : chr  "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
# Check for missing values
cat("Missing values per column:\n")
## Missing values per column:
colSums(is.na(data))
##            id          date         price      bedrooms     bathrooms 
##             0             0             0             0             0 
##   sqft_living      sqft_lot        floors    waterfront          view 
##             0             0             0             0             0 
##     condition         grade    sqft_above sqft_basement      yr_built 
##             0             0             0             0             0 
##  yr_renovated       zipcode           lat          long sqft_living15 
##             0             0             0             0             0 
##    sqft_lot15 
##             0

Identify Potential Data Entry Errors

In this section, we identify observations that have clearly incorrect or suspicious values. These include homes with 0 or large numbers of bedrooms and bathrooms, 0 square footage, and any unusual pricing or year values.

Bedroom and Bathroom Anomalies

# Check for problematic bedroom and bathroom values
problem_rows <- data[
  (data$bedrooms == 0 | data$bedrooms == 33 | data$bathrooms == 0) &
  !is.na(data$bedrooms) & !is.na(data$bathrooms),
  c("id", "bedrooms", "bathrooms", "sqft_living", "price", "zipcode")
]

cat("Properties with suspicious bedroom/bathroom values:\n")
## Properties with suspicious bedroom/bathroom values:
print(problem_rows)
##               id bedrooms bathrooms sqft_living   price zipcode
## 876   6306400140        0      0.00        3064 1095000   98102
## 1150  3421079032        1      0.00         670   75000   98022
## 3120  3918400017        0      0.00        1470  380000   98133
## 3468  1453602309        0      1.50        1430  288000   98125
## 4869  6896300380        0      1.00         390  228000   98118
## 5833  5702500050        1      0.00         600  280000   98045
## 6995  2954400190        0      0.00        4810 1295650   98053
## 8478  2569500210        0      2.50        2290  339950   98042
## 8485  2310060040        0      2.50        1810  240000   98038
## 9774  3374500520        0      0.00        2460  355000   98031
## 9855  7849202190        0      0.00        1470  235000   98065
## 10482  203100435        1      0.00         690  484000   98053
## 12654 7849202299        0      2.50        1490  320000   98065
## 14424 9543000205        0      0.00         844  139950   98001
## 15871 2402100895       33      1.75        1620  640000   98103
## 18380 1222029077        0      0.75         384  265000   98070
## 19453 3980300371        0      0.00         290  142000   98024
cat("\nFound", nrow(problem_rows), "properties with extreme bedroom/bathroom values")
## 
## Found 17 properties with extreme bedroom/bathroom values

Manual Data Corrections

Based on external verification using King County Parcel Viewer, we apply targeted corrections to properties with verified data entry errors:

# Apply manual corrections based on external verification
corrections <- list(
  list(id = 6306400140, bedrooms = 5, bathrooms = 4.50),
  list(id = 3421079032, bedrooms = 3, bathrooms = 3.75),
  list(id = 3918400017, bedrooms = 3, bathrooms = 2.25),
  list(id = 6896300380, bedrooms = 3, bathrooms = NA), # bedroom only
  list(id = 2954400190, bedrooms = 4, bathrooms = 4),
  list(id = 2569500210, bedrooms = 4, bathrooms = NA),
  list(id = 2310060040, bedrooms = 4, bathrooms = NA),
  list(id = 7849202190, bedrooms = 3, bathrooms = 1.50),
  list(id = 7849202299, bedrooms = 0, bathrooms = NA), # verified studio
  list(id = 9543000205, bedrooms = 2, bathrooms = 1),
  list(id = 2402100895, bedrooms = 3, bathrooms = NA), # was 33 bedrooms
  list(id = 1222029077, bedrooms = 1, bathrooms = 1.50),
  list(id = 3374500520, bedrooms = 4, bathrooms = 3.5)
)

# Apply corrections
for (correction in corrections) {
  if (!is.na(correction$bedrooms)) {
    data[data$id == correction$id, "bedrooms"] <- correction$bedrooms
  }
  if (!is.na(correction$bathrooms)) {
    data[data$id == correction$id, "bathrooms"] <- correction$bathrooms
  }
}

# Remove unverifiable records
remove_ids <- c(5702500050, 203100435, 3980300371)
data <- data[!data$id %in% remove_ids, ]

cat("Applied corrections to", length(corrections), "properties")
## Applied corrections to 13 properties
cat("\nRemoved", length(remove_ids), "unverifiable records")
## 
## Removed 3 unverifiable records
cat("\nRemaining records:", nrow(data))
## 
## Remaining records: 21610

As part of the data cleaning process, we found 17 properties with unusual or clearly incorrect values for bedrooms and bathrooms. Some had 0 bedrooms or 0 bathrooms, and one even had 33 bedrooms, which is obviously unrealistic. Instead of deleting these rows right away, we looked each one up manually using the King County Parcel Viewer to verify what the correct values should be. For most of them, we was able to confirm the actual number of rooms and made the necessary corrections based on that. For instance, one home listed with 0 bedrooms and 0 bathrooms was corrected to 4 bedrooms and 4 bathrooms after verification, and the home listed with 33 bedrooms was updated to 3, which made more sense given its size. One property turned out to be a studio-style layout with 0 bedrooms and 1.5 bathrooms, so we kept that one as-is. In three cases, we couldn’t find any record of the property, and since the information couldn’t be confirmed and looked suspicious, we decided to remove those rows. Overall, this process helped us clean the dataset in a careful and meaningful way, using outside sources to make informed decisions instead of relying only on assumptions or automatic removals.

Check for Duplicate Property IDs

# 177 homes with duplicate IDs
dup_count <- sum(duplicated(data$id))
cat(dup_count, "homes with duplicate IDs\n")
## 177 homes with duplicate IDs

Living Area and Lot Size Check for Zero

We check for homes with sqft_living == 0 or sqft_lot == 0. A house cannot have 0 square feet of living space or land area.

data[data$sqft_living == 0 | data$sqft_lot == 0, ]
##  [1] id            date          price         bedrooms      bathrooms    
##  [6] sqft_living   sqft_lot      floors        waterfront    view         
## [11] condition     grade         sqft_above    sqft_basement yr_built     
## [16] yr_renovated  zipcode       lat           long          sqft_living15
## [21] sqft_lot15   
## <0 rows> (or 0-length row.names)

No homes in the dataset have 0 for sqft_living or sqft_lot

Price Check for Zero or negative

The price variable should always be a positive number. A home with a price of 0 or less is invalid for this dataset, and such rows should be removed.

summary(data$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   75000  322000  450000  540121  645000 7700000
data[data$price <= 0, ]
##  [1] id            date          price         bedrooms      bathrooms    
##  [6] sqft_living   sqft_lot      floors        waterfront    view         
## [11] condition     grade         sqft_above    sqft_basement yr_built     
## [16] yr_renovated  zipcode       lat           long          sqft_living15
## [21] sqft_lot15   
## <0 rows> (or 0-length row.names)

There are no homes with a price ≤ 0. There were no such entries, so no cleaning was necessary for this variable.

Year Built & Year Renovated Check for future years

We check for homes with yr_built or yr_renovated in the future (after 2015)

data[data$yr_built > 2015 | data$yr_renovated > 2015, ]
##  [1] id            date          price         bedrooms      bathrooms    
##  [6] sqft_living   sqft_lot      floors        waterfront    view         
## [11] condition     grade         sqft_above    sqft_basement yr_built     
## [16] yr_renovated  zipcode       lat           long          sqft_living15
## [21] sqft_lot15   
## <0 rows> (or 0-length row.names)

Dataset has no homes built or renovated after 2015.

Feature Engineering

Date Feature Extraction

The date column is a character string and not useful for modeling. We’ll extract: year and month sold.

data$year_sold <- as.numeric(substr(data$date, 1, 4))
data$month_sold <- as.numeric(substr(data$date, 5, 6))

table(data$year_sold)
## 
##  2014  2015 
## 14630  6980
table(data$month_sold)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
##  978 1250 1875 2231 2414 2180 2211 1940 1772 1878 1410 1471

Geographic Region Classification

zipcode has over 70 unique values. That’s too many for regression modeling, and need 70 dummy variables, which will clutters model and Increases risk of overfitting

city_zips <- c(98101, 98102, 98104, 98105, 98109, 98112, 98115, 98116, 98118, 98119, 98121, 98122, 98125, 98126, 98133, 98134,98136, 98144, 98154, 98164, 98174, 98195)

suburb_zips <- c(98004, 98005, 98006, 98007, 98008, 98027, 98029, 98033,98034, 98040, 98052, 98053, 98056, 98057, 98059, 98072,98074, 98075, 98092, 98070, 98028, 98019)

rural_zips <- setdiff(unique(data$zipcode), union(city_zips, suburb_zips))

data$region <- case_when(
  data$zipcode %in% city_zips ~ "City",
  data$zipcode %in% suburb_zips ~ "Suburb",
  data$zipcode %in% rural_zips ~ "Rural"
)

data$region <- factor(data$region, levels = c("City", "Suburb", "Rural"))

table(data$region)
## 
##   City Suburb  Rural 
##   4471   7266   9873

4471 are city home, 7266 are suburb, and 9873 are rural.

Grouping Homes by Renovation Recency

data$renovation_group <- case_when(
  data$yr_renovated == 0 ~ "Never Renovated",
  data$yr_renovated >= 2005 ~ "Recently Renovated",
  TRUE ~ "Renovated Long Ago"
)
data$renovation_group <- factor(data$renovation_group)
table(data$renovation_group)
## 
##    Never Renovated Recently Renovated Renovated Long Ago 
##              20696                320                594

20,699 homes were not renovated, 320 homes were recently renovated; and 594 are renovated long ago

Transform Latitude & Longitude into Distance to Downtown and compare distance with condition

data$waterfront <-factor(data$waterfront)

Transform Latitude & Longitude into Distance to Downtown and compare distance with condition

data$distance_to_downtown <- sqrt(
  (data$lat - 47.6062)^2 + (data$long + 122.3321)^2
)

summary(data$distance_to_downtown)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01311 0.09646 0.17881 0.20076 0.28468 1.02269

Check for multicollinearity

# Check for multicollinearity among square footage variables
sqft_vars <- data[, c("sqft_living", "sqft_above", "sqft_basement", "sqft_living15")]
cor_matrix <- round(cor(sqft_vars), 2)
cor_matrix
##               sqft_living sqft_above sqft_basement sqft_living15
## sqft_living          1.00       0.88          0.43          0.76
## sqft_above           0.88       1.00         -0.05          0.73
## sqft_basement        0.43      -0.05          1.00          0.20
## sqft_living15        0.76       0.73          0.20          1.00
data$sqft_above <- NULL
data$sqft_basement <- NULL

Exploratory Data Analysis

This section contains visualizations that explore how price is related to the other factors.

# Split data into training and testing. We will only use the training data for the visualizations
set.seed(1)
sample<-sample.int(nrow(data), floor(.80*nrow(data)), replace = F)
train<-data[sample, ]
test<-data[-sample, ]

Density plot for price

# distribution of price
ggplot(train, aes(x=price))+
  scale_x_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Price", y="Density", title = "Distrubution of Price")+
  geom_density()

### Boxplot of price

ggplot(train, aes(x= "", y=price))+
  geom_boxplot()+
  scale_y_continuous(breaks = breaks_extended(10),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Price", y="Price", title = "Summary of Price")

# Boxplot statistics
summary(train$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   75000  324866  450000  540458  645500 7700000

Bar chart of Average Price & Boxplot of Price by Number of Bedrooms

#bar chart and box plot that show how # of bedrooms are related to price
ggplot(train, aes(x=bedrooms, y=price))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Bedrooms", y="Averge Price", title = "Average Price by Number of Bedrooms")+
  stat_summary(geom="bar", fill="red")

ggplot(train, aes(x=as.factor(bedrooms), y=price))+
  geom_boxplot(fill="lightblue")+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Bedrooms", y="Price", title = "Price Distribtion by Number of Bedrooms")

Bar chart of Average Price by Number of Bathrooms

#bar chart and box plot that show how # of bedrooms are related to price
ggplot(train, aes(x=cut(bathrooms, breaks=c(0,1,2,3,4,5,6,7,8)), y=price))+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Bathrooms", y="Averge Price", title = "Average Price by Number of Bathrooms")+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  scale_x_discrete(labels = c("0-1", "1-2", "2-3","3-4","4-5", "5-6", "6-7", "7+"))+
  stat_summary(geom="bar", fill="red")

Scatter plot of price against square footage

ggplot(train, aes(x=sqft_living, y=price))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Living Square Footage", y="Price", title = "Price Against Living Square Footage")+
  geom_point()

Boxplots of price by waterfront

ggplot(train, aes(x=waterfront, y=price))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Waterfront", y="Price", title = "Price by Waterfront")+
  geom_boxplot(fill="lightblue")

Barcharts of Price by Grade and Condition

grade <- ggplot(train, aes(x=grade, y=price))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Grade", y="Price", title = "Price by Grade")+
  stat_summary(geom="bar", fill="green")

condition <- ggplot(train, aes(x=condition, y=price))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Condition", y="Price", title = "Price by Condition")+
  stat_summary(geom="bar", fill="green")

grid.arrange(grade,condition,ncol=2,nrow=2)

Box pots of Log (Price) by Region

# 'Region' was derived from 'zipcode'
ggplot(train, aes(x=region, y=log(price)))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Region", y="log(Price)", title = "log(Price) by Region")+
  geom_boxplot(fill="lavender")

Density plots of Distance to Downtown & Year Built by Region

dist <-ggplot(train, aes(x=distance_to_downtown, color=region))+
        theme(plot.title = element_text(hjust=.5))+
        labs(x="Distance to Downtown (decimal-degrees)", 
            title = "Density Plot of Distance to Downtown")+
        geom_density()

built <- ggplot(train,aes(x=yr_built, color=region))+
          theme(plot.title = element_text(hjust=.5))+
          labs(x="Year Built",
               title = "Density Plot of Distance to Downtown")+
          geom_density()

grid.arrange(dist,built,ncol=2,nrow=2)

Scatterplot of Log (Price) Against Distance to Downtown by Region

ggplot(train, aes(x=distance_to_downtown, y=log(price), color=region))+
  theme(plot.title=element_text(hjust=.5))+
  scale_y_continuous(breaks = breaks_extended(6),labels =    label_dollar())+
  labs(x="Distance to Downtown (decimal-degrees)",
       y= "Log(price)",
       color = "Region",
       title = "Log(price) Against Distance to Downtown by Region")+
  geom_point()

Scatterplot of Log (Price) Against Distance to Downtown by Grade

ggplot(train, aes(x=distance_to_downtown, y=log(price), color=grade))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  scale_color_gradient(low = "blue", high = "red")+
  theme(plot.title=element_text(hjust=.5))+
  labs(x="Distance to Downtown (decimal-degrees)",
       y= "Log(price)",
       color = "Grade",
       title = "Log(price) Against Distance to Downtown by Grade")+
  geom_point()

Scatterplot of Log (Price) Against Condition by Year Built

ggplot(train,aes(x=condition,y=log(price), color=yr_built))+
  scale_y_continuous(breaks = breaks_extended(6),labels =    label_dollar())+
  scale_color_gradientn(colors=c("steelblue", "skyblue", "lightgreen", "gold", "tomato"))+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Condition",
       y= "log(Price)",
       color = "Year Built",
       title = "log(Price) Against Condition by Year Built")+
  geom_point()

Scatterplot of Log (Price) by Distance to Downtown by Renovated

ggplot(train,aes(x=distance_to_downtown,y=log(price), color=renovation_group))+
  scale_y_continuous(breaks = breaks_extended(6),labels =    label_dollar())+
  theme(plot.title = element_text(hjust=.05))+
  labs(x="Distance to Downtown (decimal-degrees)",
       y= "log(Price)",
       color = "Renovated",
       title = "log(Price) Against Distance to Downtown by Renovated")+
  geom_point(alpha=.5)

Linear Regression Analysis

This section presents a linear regression model for predicting Price and outlines the methodology used to develop it.

Define linear regression model with price as response variable and all other variables as predictor variables

# 21 predictor variables
full_model <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + waterfront + view + condition + grade + yr_built + yr_renovated + zipcode + lat + long + sqft_living15 + sqft_lot15 + year_sold + month_sold + region + renovation_group + distance_to_downtown, data = train)

summary(full_model)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + view + condition + grade + yr_built + 
##     yr_renovated + zipcode + lat + long + sqft_living15 + sqft_lot15 + 
##     year_sold + month_sold + region + renovation_group + distance_to_downtown, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1217493   -96073   -10555    76541  4303148 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         4.860e+07  1.131e+07   4.298 1.73e-05 ***
## bedrooms                           -4.144e+04  2.131e+03 -19.440  < 2e-16 ***
## bathrooms                           2.559e+04  3.469e+03   7.376 1.70e-13 ***
## sqft_living                         1.860e+02  3.636e+00  51.156  < 2e-16 ***
## sqft_lot                            2.201e-01  5.601e-02   3.930 8.51e-05 ***
## floors                              4.230e+03  3.527e+03   1.199    0.230    
## waterfront1                         6.362e+05  1.885e+04  33.749  < 2e-16 ***
## view                                5.193e+04  2.278e+03  22.800  < 2e-16 ***
## condition                           2.822e+04  2.529e+03  11.158  < 2e-16 ***
## grade                               8.793e+04  2.322e+03  37.868  < 2e-16 ***
## yr_built                           -1.825e+03  8.191e+01 -22.278  < 2e-16 ***
## yr_renovated                        3.618e+03  6.332e+02   5.714 1.12e-08 ***
## zipcode                            -7.417e+02  3.883e+01 -19.099  < 2e-16 ***
## lat                                 2.832e+05  1.466e+04  19.318  < 2e-16 ***
## long                                4.820e+05  2.394e+04  20.136  < 2e-16 ***
## sqft_living15                       2.023e+01  3.690e+00   5.482 4.26e-08 ***
## sqft_lot15                         -7.703e-02  7.968e-02  -0.967    0.334    
## year_sold                           3.615e+04  5.060e+03   7.144 9.44e-13 ***
## month_sold                          1.051e+03  7.593e+02   1.384    0.166    
## regionSuburb                       -2.805e+04  5.403e+03  -5.192 2.11e-07 ***
## regionRural                        -1.973e+04  4.657e+03  -4.237 2.28e-05 ***
## renovation_groupRecently Renovated -7.196e+06  1.273e+06  -5.653 1.60e-08 ***
## renovation_groupRenovated Long Ago -7.161e+06  1.259e+06  -5.690 1.29e-08 ***
## distance_to_downtown               -1.092e+06  3.172e+04 -34.424  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 193500 on 17264 degrees of freedom
## Multiple R-squared:  0.7239, Adjusted R-squared:  0.7235 
## F-statistic:  1968 on 23 and 17264 DF,  p-value: < 2.2e-16

Note: ‘id’ and ‘date’ were excluded from the full model since ‘id’ is a unique identifier with no predictive value, and ‘date’ contains too many unique values to be effectively represented as dummy variables.

Based on the full model summary, the variables ‘floors’, ‘sqft_lot15’, and ‘month_sold’ are not statistically significant predictors of price and can be removed. Additionally, ‘distance_to_downtown’ renders the geographic coordinates redundant, while ‘region’ captures the variation in ‘zipcode’, allowing both to be excluded.

Influential observations, high leverages observations, and outliers for the Full Model

# Compute diagnostics in one cell
ext.student.res <- rstudent(full_model)
student_res_count <- sum(abs(ext.student.res) > 2)

std.res <- rstandard(full_model)
std_res_count <- sum(abs(std.res) > 2)

lev <- lm.influence(full_model)$hat
n <- nrow(data)
p <- 21
high_lev_count <- sum(lev > 2 * p / n)

DFFITS_vals <- dffits(full_model)
dffits_count <- sum(abs(DFFITS_vals) > 2 * sqrt(p / n))

COOKS_vals <- cooks.distance(full_model)
cooks_count <- sum(COOKS_vals > 1)

# Summary output
cat("The number of outliers according to studentized and standardized residuals is", 
    student_res_count, "and", std_res_count, 
    "respectively. There are", high_lev_count, "high leverage points,", 
    dffits_count, "influential points according to DFFITS, and", 
    cooks_count, "influential points according to Cook's distance.\n")
## The number of outliers according to studentized and standardized residuals is 592 and 592 respectively. There are 1860 high leverage points, 1162 influential points according to DFFITS, and 0 influential points according to Cook's distance.

Define linear regression model with slightly less predictor variables

# 15 predictor variables
model1 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + waterfront + view + condition + grade + yr_built + yr_renovated + sqft_living15 + year_sold + region + renovation_group + distance_to_downtown, data = train)

summary(model1)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     waterfront + view + condition + grade + yr_built + yr_renovated + 
##     sqft_living15 + year_sold + region + renovation_group + distance_to_downtown, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1204941  -100791   -11106    80187  4290929 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -5.582e+07  6.638e+06  -8.410  < 2e-16 ***
## bedrooms                           -4.388e+04  2.216e+03 -19.799  < 2e-16 ***
## bathrooms                           3.474e+04  3.536e+03   9.822  < 2e-16 ***
## sqft_living                         1.845e+02  3.786e+00  48.730  < 2e-16 ***
## sqft_lot                            1.774e-01  4.170e-02   4.253 2.12e-05 ***
## waterfront1                         5.975e+05  1.962e+04  30.448  < 2e-16 ***
## view                                4.167e+04  2.356e+03  17.690  < 2e-16 ***
## condition                           2.414e+04  2.610e+03   9.246  < 2e-16 ***
## grade                               9.246e+04  2.387e+03  38.737  < 2e-16 ***
## yr_built                           -1.995e+03  8.179e+01 -24.388  < 2e-16 ***
## yr_renovated                        3.616e+03  6.604e+02   5.475 4.43e-08 ***
## sqft_living15                       3.039e+01  3.826e+00   7.944 2.07e-15 ***
## year_sold                           2.942e+04  3.294e+03   8.931  < 2e-16 ***
## regionSuburb                        3.749e+04  5.140e+03   7.294 3.13e-13 ***
## regionRural                        -5.206e+04  4.702e+03 -11.072  < 2e-16 ***
## renovation_groupRecently Renovated -7.199e+06  1.328e+06  -5.422 5.97e-08 ***
## renovation_groupRenovated Long Ago -7.157e+06  1.313e+06  -5.453 5.03e-08 ***
## distance_to_downtown               -6.328e+05  1.674e+04 -37.793  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 201900 on 17270 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.699 
## F-statistic:  2362 on 17 and 17270 DF,  p-value: < 2.2e-16

15 predictor variables is still generally too many for a linear regression model because it increases the risk of overfitting, makes the model harder to interpret, and can introduce multicollinearity, which affects the reliability of coefficient estimates.

Use regsubsets function to identify best combination of predictor variables

allreg <- regsubsets(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + waterfront + view + condition + grade + yr_built + yr_renovated + sqft_living15 + year_sold + region + renovation_group + distance_to_downtown, data = train, nbest=1)

Adjusted R²

Rewards goodness of fit while penalizing extra predictors

coef(allreg, which.max(summary(allreg)$adjr2))
##          (Intercept)             bedrooms          sqft_living 
##         3480306.4878          -38274.5914             213.9105 
##          waterfront1                 view                grade 
##          598443.4688           45215.5715          100663.8903 
##             yr_built          regionRural distance_to_downtown 
##           -1975.7104          -81514.6571         -550557.6691

Mallows’ Cp

Identifies models with low bias and variance

coef(allreg, which.min(summary(allreg)$cp))
##          (Intercept)             bedrooms          sqft_living 
##         3480306.4878          -38274.5914             213.9105 
##          waterfront1                 view                grade 
##          598443.4688           45215.5715          100663.8903 
##             yr_built          regionRural distance_to_downtown 
##           -1975.7104          -81514.6571         -550557.6691

BIC

Favors simpler models by heavily penalizing overfitting

coef(allreg, which.min(summary(allreg)$bic))
##          (Intercept)             bedrooms          sqft_living 
##         3480306.4878          -38274.5914             213.9105 
##          waterfront1                 view                grade 
##          598443.4688           45215.5715          100663.8903 
##             yr_built          regionRural distance_to_downtown 
##           -1975.7104          -81514.6571         -550557.6691

The predictors that lead to a first-order model that have the best Adjusted R², Mallows’ Cp, and BIC are identical, therefore, our final model can be defined.

Define Model with final predictors

# 8 predictor variables
model2 <- lm(price ~ bedrooms + sqft_living + waterfront + view + grade + yr_built + region + distance_to_downtown, data = train)

summary(model2)
## 
## Call:
## lm(formula = price ~ bedrooms + sqft_living + waterfront + view + 
##     grade + yr_built + region + distance_to_downtown, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1225336  -102723   -11980    80524  4244607 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3624927.83  126156.88  28.733   <2e-16 ***
## bedrooms              -39281.91    2179.27 -18.025   <2e-16 ***
## sqft_living              212.55       3.17  67.044   <2e-16 ***
## waterfront1           589822.81   19734.88  29.887   <2e-16 ***
## view                   46319.24    2354.44  19.673   <2e-16 ***
## grade                 100239.55    2284.19  43.884   <2e-16 ***
## yr_built               -2053.39      67.44 -30.447   <2e-16 ***
## regionSuburb           43390.53    5078.93   8.543   <2e-16 ***
## regionRural           -52496.40    4742.72 -11.069   <2e-16 ***
## distance_to_downtown -600755.30   16030.20 -37.476   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 204100 on 17278 degrees of freedom
## Multiple R-squared:  0.6926, Adjusted R-squared:  0.6925 
## F-statistic:  4326 on 9 and 17278 DF,  p-value: < 2.2e-16

Check model meets linear regression assumptions

plot(model2, which = 1) # Linearity

plot(residuals(model2), type = "l", main = "Residuals Plot") # Independence of Errors

plot(model2, which = 3) # Homoscedasticity

plot(model2, which = 2) # Normality of Errors

The residual plots clearly indicate violations of both homoscedasticity and normality assumptions. The non-constant spread of residuals in the Scale-Location plot and the deviation of the Q-Q plot’s tail from the line suggest these issues. As determined from the diagnostic analysis, the 592 outliers identified in the training set likely contribute to these violations.

Remove outliers from train set, define Final Model, and verify assumptions are met

# Removing 592 rows with the greatest residuals (because there are 592 outliers)
std_res <- rstandard(model2)
outlier_residuals <- order(abs(std_res), decreasing = TRUE)[1:592]
train <- train[-outlier_residuals, ]

# Define Final Model
final_model <- lm(price ~ bedrooms + sqft_living + waterfront + view + grade + yr_built + region + distance_to_downtown, data = train)

summary(final_model)
## 
## Call:
## lm(formula = price ~ bedrooms + sqft_living + waterfront + view + 
##     grade + yr_built + region + distance_to_downtown, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -416578  -84575   -9119   74561  708921 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3154992.38   83116.54   37.96   <2e-16 ***
## bedrooms              -25599.69    1454.72  -17.60   <2e-16 ***
## sqft_living              154.66       2.21   69.97   <2e-16 ***
## waterfront1           584909.84   17873.87   32.72   <2e-16 ***
## view                   45371.29    1609.94   28.18   <2e-16 ***
## grade                  96291.53    1522.16   63.26   <2e-16 ***
## yr_built               -1782.66      44.45  -40.10   <2e-16 ***
## regionSuburb           41911.57    3347.57   12.52   <2e-16 ***
## regionRural           -59862.33    3106.14  -19.27   <2e-16 ***
## distance_to_downtown -457496.64   10671.37  -42.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 131300 on 16686 degrees of freedom
## Multiple R-squared:  0.758,  Adjusted R-squared:  0.7579 
## F-statistic:  5807 on 9 and 16686 DF,  p-value: < 2.2e-16

Final Regression Equation: price = 3154933.30 - 25591.98(bedrooms) + 154.65(sqft_living) + 584913.42(waterfront) + 45372.16(view) + 96291.07(grade) - 1782.64(yr_built) + 41909.01(regionSuburb) - 59865.43(regionRural) - 457494.79(distance_to_downtown)

Check Final Model meets linear regression assumptions

plot(final_model, which = 1) # Linearity

plot(residuals(final_model), type = "l", main = "Residuals Plot") # Independence of Errors

plot(final_model, which = 3) # Homoscedasticity

plot(final_model, which = 2) # Normality of Errors

Despite the initially heavy tails observed in the residual plots, the context of our house price dataset, where significant price variations are expected, allows us to conclude that the linear regression assumptions are met after outlier removal. Specifically: linearity is supported by the random scatter of residuals in the Residuals vs Fitted plot; independence is indicated by the randomness in the residuals plot; homoscedasticity is satisfied by the constant spread of residuals; and normality of errors is supported by the linear alignment of points in the Q-Q plot.

Assess Final Model’s predicitive ability on test data

# Predict the values using the test data
predictions <- predict(final_model, newdata = test)
actual_prices <- test$price

# Assessment metric: RMSE
mse <- mean((predictions - actual_prices)^2) # MSE
rmse <- sqrt(mse)
print(paste("Root Mean Squared Error (RMSE):", rmse))
## [1] "Root Mean Squared Error (RMSE): 204831.391833837"
# R-squared
rss <- sum((predictions - actual_prices)^2)  # Residual sum of squares
tss <- sum((actual_prices - mean(actual_prices))^2)  # Total sum of squares
rsq <- 1 - rss / tss  # R-squared
print(paste("R-squared:", rsq))
## [1] "R-squared: 0.682522595097385"

The RMSE of approximately 204,832 indicates that, on average, the model’s predicted house prices deviate from actual prices by about $205K. This suggests a relatively high level of error, especially if most homes in the dataset are moderately priced. The R-squared value of 0.683 means the model explains about 68.3% of the variation in house prices, showing moderate predictive power.

Home Quality Analysis

Univariate – Distribution of Grade and Condition

grid.arrange(
  ggplot(train, aes(x = factor(grade))) +
    geom_bar(fill = "steelblue") +
    labs(title = "Distribution of Home Grades",
         x = "Grade (1 = Low Quality, 13 = High Quality)",
         y = "Number of Homes") +
    theme_minimal(),

  ggplot(train, aes(x = factor(condition))) +
    geom_bar(fill = "darkorange") +
    labs(title = "Distribution of Home Conditions",
         x = "Condition (1 = Poor, 5 = Excellent)",
         y = "Number of Homes") +
    theme_minimal(),

  ncol = 2
)

ggsave("grade_condition_distribution.png", width = 12, height = 5, dpi = 300)
train <- train %>% 
  mutate(good_quality = ifelse(condition > 3 & grade > 7, "yes", "no"))


test <- test %>% 
  mutate(good_quality = ifelse(condition > 3 & grade > 7, "yes", "no"))

Univariate - Class Distribution of Good Quality

ggplot(train, aes(x = factor(good_quality))) +
  geom_bar(fill = "lightsteelblue") +
  theme(plot.title=element_text(hjust=.5))+
  labs(
    title = "Distribution of Good Quality Homes",
    x = "Good Quality (1 = Yes, 0 = No)",
    y = "Number of Homes"
  ) 

Bivariate - Distribution of Sale Price by Home Quality Group

ggplot(train, aes(x = factor(good_quality), y = price)) +
  geom_boxplot(fill = "lightblue") +
  theme(plot.title = element_text(hjust=.5)) +
  labs(
    title = "Price Distribution by Home Quality",
    x = "Good Quality (1 = Yes, 0 = No)",
    y = "Sale Price"
  ) +
  scale_y_continuous(breaks = breaks_extended(6), labels = label_dollar()) +
  theme_minimal()

### Bivariate - Region vs. Good Quality

ggplot(train, aes(x = region, fill = factor(good_quality))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("lightgray", "steelblue"),
                    labels = c("Not Good Quality", "Good Quality")) +
  labs(
    title = "Proportion of Good Quality Homes by Region",
    x = "Region",
    y = "Proportion of Homes",
    fill = "Home Quality"
  ) +
  theme_minimal()

Bivariate - Region vs. Good Quality

ggplot(train, aes(x = factor(good_quality), y = distance_to_downtown)) +
  geom_boxplot(fill = "darkseagreen") +
  labs(
    title = "Distance to Downtown by Home Quality",
    x = "Good Quality (1 = Yes, 0 = No)",
    y = "Distance to Downtown (miles)"
  ) +
  theme_minimal()

Multivariate - Price vs. sqfr_living, by Good Quality

ggplot(train, aes(x = sqft_living, y = price, color = factor(good_quality))) +
  geom_point(alpha = 0.4) +
  labs(
    title = "Price vs. Living Area Colored by Good Quality",
    x = "Sqft Living",
    y = "Price",
    color = "Good Quality"
  ) +
  scale_y_continuous(labels = label_dollar()) +
  theme_minimal()

Multivariate - Price vs. Distance to Downtown, by Good Quality

ggplot(train, aes(x = distance_to_downtown, y = price, color = factor(good_quality))) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Price vs. Distance to Downtown Colored by Home Quality",
    x = "Distance to Downtown (decimal-degrees)",
    y = "Price",
    color = "Good Quality"
  ) +
  scale_color_manual(values = c("gray70", "steelblue")) +
  scale_y_continuous(breaks = breaks_extended(6), labels = label_dollar()) +
  theme_minimal()

Scatterplot of Log(Price) vs Distance to Downtown

ggplot(train, aes(x = distance_to_downtown, y = log(price))) +
  geom_point(alpha = 0.2, color = "black", size = 1) +   # Light black dots
  geom_smooth(method = "loess", se = FALSE, color = "blue", size = 1.2) +  # Trend line
  labs(title = "Log(Price) vs Distance to Downtown",
       x = "Distance to Downtown Seattle (decimal-degrees)",
       y = "Log(Price)") +
  theme_minimal()

Logistic Regression for Home Quality Classification

Data Preparation for Classification

train$good_quality <- factor(train$good_quality)
test$good_quality <- factor(test$good_quality)

Exploratory Analysis for Classification Variables

chart1<-ggplot(train, aes(x=waterfront, fill=good_quality))+
  geom_bar(position = "fill")+
  labs(x="Waterfront Property", y="Proportion",
       title="Proportion of Good Quaility Homes by Waterfront")

chart2<-ggplot(train, aes(x=region, fill=good_quality))+
  geom_bar(position = "fill")+
  labs(x="Region", y="Proportion",
       title="Proportion of Good Quaility Homes by Region")

chart3<-ggplot(train, aes(x=renovation_group, fill=good_quality))+
  geom_bar(position = "fill")+
  labs(x="Renovation Group", y="Proportion",
       title="Proportion of Good Quaility Homes by Renovation Group")

Proportion of Good Quaility Homes by Waterfront

chart1

Proportion of Good Quaility Homes by Region

chart2

Proportion of Good Quaility Homes by Renovation Group

chart3

Based on these bar charts, roughly 2/3 of good quality homes have a waterfront, there are more good quality suburban homes than rural and urban homes, and there are few good quality homes that are also recently renovated.

dp1<-ggplot2::ggplot(train,aes(x=log(price), color=good_quality))+
  geom_density()+
  labs(title="Density of Price by Good Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

dp2<-ggplot2::ggplot(train,aes(x=sqft_living, color=good_quality))+
  geom_density()+
  labs(title="Density of Square Feet by Good Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

dp3<-ggplot2::ggplot(train,aes(x=floors, color=good_quality))+
  geom_density()+
  labs(title="Density of Number of Floors by Good Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

dp4<-ggplot2::ggplot(train,aes(x=yr_built, color=good_quality))+
  geom_density()+
  labs(title="Density of Year Built by Good Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

dp5<-ggplot2::ggplot(train,aes(x=distance_to_downtown, color=good_quality))+
  geom_density()+
  labs(x="Distance to Downtown (decimal-degrees)", title="Density of 
       Distance to Donwtown by Good Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

dp6<-ggplot2::ggplot(train,aes(x=bedrooms, color=good_quality))+
  geom_density()+
  labs(title="Density of Bedrooms by Goood Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

gridExtra::grid.arrange(dp1, dp2, ncol = 2, nrow = 1)

gridExtra::grid.arrange(dp3, dp4, ncol = 2, nrow = 1)

gridExtra::grid.arrange(dp5, dp6, ncol = 2, nrow = 1)

Correlation matrix of predictor variables

# Correlation matrix of quantitative predictors
round(cor(train[,c(3:8, 10, 13 ,24)], use= "complete.obs"),3)
##                       price bedrooms bathrooms sqft_living sqft_lot floors
## price                 1.000    0.312     0.517       0.702    0.103  0.290
## bedrooms              0.312    1.000     0.514       0.594    0.036  0.169
## bathrooms             0.517    0.514     1.000       0.738    0.091  0.506
## sqft_living           0.702    0.594     0.738       1.000    0.187  0.353
## sqft_lot              0.103    0.036     0.091       0.187    1.000 -0.011
## floors                0.290    0.169     0.506       0.353   -0.011  1.000
## view                  0.380    0.064     0.155       0.235    0.069  0.016
## yr_built              0.068    0.167     0.527       0.341    0.054  0.501
## distance_to_downtown -0.177    0.116     0.180       0.197    0.275  0.061
##                        view yr_built distance_to_downtown
## price                 0.380    0.068               -0.177
## bedrooms              0.064    0.167                0.116
## bathrooms             0.155    0.527                0.180
## sqft_living           0.235    0.341                0.197
## sqft_lot              0.069    0.054                0.275
## floors                0.016    0.501                0.061
## view                  1.000   -0.056               -0.062
## yr_built             -0.056    1.000                0.436
## distance_to_downtown -0.062    0.436                1.000

Based on previous visualizations, waterfront, region,renovation, price, sqft_living, floors, yr_built, distance to downtown, and bedrooms may influence the a home being deemd good quality or not.

Define full logistic regression model

full_log <- glm(good_quality~price + sqft_living + yr_built + distance_to_downtown + waterfront + region + renovation_group , data = train, family = binomial())

summary(full_log)
## 
## Call:
## glm(formula = good_quality ~ price + sqft_living + yr_built + 
##     distance_to_downtown + waterfront + region + renovation_group, 
##     family = binomial(), data = train)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         4.583e+01  2.360e+00  19.418  < 2e-16 ***
## price                               1.712e-06  1.473e-07  11.623  < 2e-16 ***
## sqft_living                         5.058e-04  4.820e-05  10.494  < 2e-16 ***
## yr_built                           -2.574e-02  1.234e-03 -20.853  < 2e-16 ***
## distance_to_downtown                4.982e-01  3.212e-01   1.551    0.121    
## waterfront1                        -1.445e+00  3.704e-01  -3.901 9.58e-05 ***
## regionSuburb                        1.092e+00  8.882e-02  12.299  < 2e-16 ***
## regionRural                         4.962e-01  8.396e-02   5.910 3.43e-09 ***
## renovation_groupRecently Renovated -3.125e+00  5.129e-01  -6.092 1.12e-09 ***
## renovation_groupRenovated Long Ago -1.037e+00  1.722e-01  -6.024 1.70e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12037  on 16695  degrees of freedom
## Residual deviance: 10344  on 16686  degrees of freedom
## AIC: 10364
## 
## Number of Fisher Scoring iterations: 6

Checking if there is multicollinearity among the predictor variables

vif(full_log)
##                              price                        sqft_living 
##                          25.794900                          27.314048 
##                           yr_built               distance_to_downtown 
##                          21.912356                          24.973604 
##                        waterfront1                       regionSuburb 
##                           8.200852                          29.237259 
##                        regionRural renovation_groupRecently Renovated 
##                          29.235019                          59.679356 
## renovation_groupRenovated Long Ago 
##                          12.143194

There seems to be a high degree of multicollinearity. From the correlation matrix, sqft_living, bedrooms, and bathrooms seemed to all be correlated to one another. Let’s drop distance to downtown and waterfront, as they are reported as not significant. I’d conduct a liklihood ratio test to see if we can drop these variables from the model. First I’d like to compute the accuracy and error rate from the confusion matrix for this model.

Assess model’s predicitive ability on test data

# Predicted probs for test data
preds<-predict(full_log,newdata=test, type="response")

# Confusion matrix with threshold of 0.5
table(test$good_quality, preds>0.5)
##      
##       FALSE TRUE
##   no   3715   80
##   yes   458   69
# Calculate accuracy and error rate
accuracy <- sum(diag(table(test$good_quality, preds > 0.5))) / sum(table(test$good_quality, preds > 0.5))
error_rate <- 1 - accuracy
cat("Accuracy:", round(accuracy * 100, 2), "%\nError Rate:", round(error_rate * 100, 2), "%")
## Accuracy: 87.55 %
## Error Rate: 12.45 %

Here we have what appears to be good accuracy and error_rate, but we must remember that most of the data set is already made up of homes that are considered not good_quality, meaning we are dealing with an unbalanced sample size of the response variable. Lets look at ROC and AUC.

ROC and AUC

# Produce the numbers associated with classification table
rates<-ROCR::prediction(preds, test$good_quality)

# Store the true positive and false positive rates
roc_result<-ROCR::performance(rates,measure="tpr", x.measure="fpr")

# Plot ROC curve and overlay the diagonal line for random guessing
plot(roc_result, main="ROC Curve for Full Model")
lines(x = c(0,1), y = c(0,1), col="red")

# Compute the AUC
auc<-performance(rates, measure = "auc")
auc@y.values
## [[1]]
## [1] 0.7865488

Now we see that our model that is full performs better than random guessing and has an AUC of 0.77. I’d like to see if a reduced model would perform better.

Define reduced logistic regression model

reduced_log <- glm(good_quality~price + sqft_living  + yr_built + region , data = train, family = binomial())

summary(reduced_log)
## 
## Call:
## glm(formula = good_quality ~ price + sqft_living + yr_built + 
##     region, family = binomial(), data = train)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.009e+01  2.118e+00  18.927  < 2e-16 ***
## price         1.351e-06  1.264e-07  10.686  < 2e-16 ***
## sqft_living   5.337e-04  4.474e-05  11.929  < 2e-16 ***
## yr_built     -2.273e-02  1.103e-03 -20.606  < 2e-16 ***
## regionSuburb  1.121e+00  8.443e-02  13.278  < 2e-16 ***
## regionRural   4.913e-01  8.004e-02   6.138 8.34e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12037  on 16695  degrees of freedom
## Residual deviance: 10499  on 16690  degrees of freedom
## AIC: 10511
## 
## Number of Fisher Scoring iterations: 5

We see that the all predictors for this model are significant.

Assess reduced model’s predicitive ability on test data

# Predicted probs for test data
preds<-predict(reduced_log,newdata=test, type="response")

# Confusion matrix with threshold of 0.5
table(test$good_quality, preds>0.5)
##      
##       FALSE TRUE
##   no   3714   81
##   yes   466   61
# Calculate accuracy and error rate
accuracy <- sum(diag(table(test$good_quality, preds > 0.5))) / sum(table(test$good_quality, preds > 0.5))
error_rate <- 1 - accuracy
cat("Accuracy:", round(accuracy * 100, 2), "%\nError Rate:", round(error_rate * 100, 2), "%")
## Accuracy: 87.34 %
## Error Rate: 12.66 %

ROC and AUC

# Produce the numbers associated with classification table
rates<-ROCR::prediction(preds, test$good_quality)

# Store the true positive and false positive rates
roc_result<-ROCR::performance(rates,measure="tpr", x.measure="fpr")

# Plot ROC curve and overlay the diagonal line for random guessing
plot(roc_result, main="ROC Curve for Reduced Model")
lines(x = c(0,1), y = c(0,1), col="red")

auc<-performance(rates, measure = "auc")
auc@y.values
## [[1]]
## [1] 0.7818572

As we can see the AUC for the reduced model is very slightly higher than for the full model. Let’s conduct a likelihood ratio test to determine which model to use.

H0: B4 = B5 = B8 = 0 Ha: at least one beta in null is not zero.

# Residual deviances
DR <- reduced_log$deviance  # Deviance of reduced model
DF <- full_log$deviance     # Deviance of full model      

delta_g <- DR - DF

CV <- qchisq(0.95,3)

delta_g > CV
## [1] TRUE

So we reject the null hypothesis, meaning the additional predictors significantly improve the model, and therefore should go forward with the full model over the reduced one.